EDA

From the looks above the dataset does not contain any duplicated row

There are many rows with missing data in History columns, knowing that these are customers data, means the missing value is not intentional. Sometimes the data is NaN because it is intentionally left like that. For example, if the customer has never make a purchase at Bee Department Store, then maybe the NaN value is to indicate that they have no purchase history. However, there is AmountSpent column that indicates how much they have spent at Bee Department Store which means all of the customer here had made a purchase at Bee Department Store.

Univariate Analysis

From the visualization above we know that:

Age

Gender

OwnHome

Married

Location

History

AmountSpent

From the visualization above we can get information that:

Pandas Profile Report

Multivariate Analysis

Because we are going to compare the relationship between categorical and continuous variable we should use Phik correlation. Not only it can depicts the relationship between categorical and continuos variable, it can also cath a non-linear relationship between variables.

the relationship between the customer’s Age and their Salary

As we can see from the heatmap the customer's age have a big correlation (0.66) to their salary.

From the boxplot above we can see that the salary does not have a linear correlation. It has a curvy correlation, starting with the young being smallest then the peak is when the middle age, and it goes down a bit after reaching the old age. The young age has median salary of 21,400, while the middle age has 68,450 and the old has 54,600.

the customer’s marital status and their salary

From the heatmap, we can see that the customer's marital status has a very high correlation (0.85) with their salary.

Being married has a positive correlation witht the customer's salary. Because from the customer's data we can see that the median salary for the one that is married is 2 times bigger than the one that is single. The median salary for married group is 76,700 while the single group is 33,100.

the customer’s salary and their amount spent

From the heatmap we can see that the customer's salary has a high correlation with how much their amount of spent at Bee Department Store. It has a positive correlation because the higher their amount of spent at Bee Department Store means the higher their salary is.

Pre-processing

Things need to be done in preprocessing from the EDA:

  1. Label Encoding: Age, History, AmountSpent, Location, Gender, OwnHome, Married. Age, history, and AmountSpent, Location has order so it is right to make them to numerical value using the label encoder so the machine learning model can take them as input. On the other hand, although Gender, OwnHome, and Married columns does not have order but they only has 2 distinct columns so it will be the same whether we use One-Hot encoder or Label Encoder for them.
  2. Fill missing data in "History" columns. Because the missing row is too big (above 30%) to be just dropped. Furthermore, the data is only missing in one column so no need to drop it.
  3. Split Train and test test data. Because we need to test how good our model perform on unseen data, we need to split the data to train and test set. We will use the ratio of 80:20 for the splitting because the dataset is small, so the model can be trained well.
  4. Normalization: Because the scale of the numerical features are different. For example Salary has a value in range of ten of thousands while Children is only 1 until 4. Some models will not be able to predict well using these features because they will think that Salary is more important than features that has low value such as Children. Therefore, we should make the scale of these features to be the same using normalization (MinMaxScaler).

Label Encoding

Now the features in already on numerical value, the next step should be imputing the missing data. however, because we want to use iterativeImputer we should split the data first into X and y so there will not be any data leakage because the feature will be not filled using the target data as the input for the predictor.

Impute Missing Data

We will use IterativeImputer because there is no information given about the missing dataset. Iterative Imputer will calculate and predict each missing data using other features that is not missing. Hence, the data that is filled will be more accurate.

Because there are some values that is out of range and decimal value, we will round it then put it in range

Now the value is already in the correct range

Train Test Split

Because the dataset is small we will maximize the train data size so the model can be trained on more data.

Normalization (MinMaxScaler)

To prevent data leak, we split the train and test data first before doing the normalization using MinMaxScaler. The Scaler will be only trained on the train data to prevent the data leak.

Now all of the features have the same range of value (from 0 to 1). It will help some model (such as logistic regression) in predicting the target

Feature Selection

We can select the feature that we will use for the prediction using their correlation.

From the heatmap above we can see that all of the features have a decent correlation to the target. Therefore, we should select all of them to be the features for our model. However, we can see further what features we should use and drop after the model has been trained because we can use feature importance and shap value to know which feature has a big role on predicting the target.

Modeling

Performance Metrics

Because the classes are balanced in the dataset we can use accuracy as the performance metrics. In addition, we should also use f1_score which calculated using harmonic mean of precision and recall to ensure the performance of the model is good to predict all of the classes, because this is a multi-class classification we need the average for the f1-score. Hence, we will use "macro" as the average parameter because the classes are balanced

Baseline Model (KNN Classifier)

For the baseline model, we choose DummyClassifier, baseline model is useful as a control group because it only predict based on simple rule of thumb. Baseline models will show us how good other models that have been trained compared to it.

Model Selection

I choose 3 models for this dataset, one simple model and two ensemble models. For the ensemble models there are two types of model that I pick and those are bagging and boosting. I choose these three models to predict the target because they each have their own characteristics which will show me the right model for this dataset.

  1. The first model is the simple model namely KNeighborsClassifier. This model implements K-nearest neighbor(KNN) algorithm to make the prediction. It classifies the target based on the voting from the data.
  2. The second model is the ensemble model which uses Boostrap Aggregation (Bagging) namely RandomForestClassifier. This model split the dataset into multiple random sampling with replacement (Bootstrap) and uses them to train multiple homogenous model (ensemble learning), in this case the RandomForestClassifier use the decision tree as its homogenous model. Those multiple decision trees are trained then their prediction are used to make a vote on the overall target prediction. The idea of using this model is to reduce variance so the model will not be overfit.
  3. The third model is the ensemble model which uses Boosting namely XGBoostClassifier. This model make use of multiple homogenous models and train them sequentially so when the previous model make mistakes it will get pinalized and the next model will focus more on it. The idea of using this model is to improve the variance of the model and to prevent underfit issue.

Before Hyper-parameters tuning

For the model training score KNeighborsClassifier is a quite good while RandomForestClassifier and XGBClassifier results are superb with all of the score close to 1.0 that means they can accurately predict the target. On the other hand, all of their prediction on the test data are bad. All of them have a score under 0.7 that means the models are overfitted.

After Hyper-parameters Tuning

For the hyper-parameters tuning we will use RandomizedSearchCV which randomly pick a combination of hyperparameters then test it on the data using cross validation (CV). cross validation is a method of splitting the data into n numbers and then use one chunks of data as the test data and the rest as the training data and this process is repeated until all slice of the data has been used as the test data. Different from the GridSearchCV which test all of the possible combinations, RandomizedSearchCV only pick several n combinations. After testing n hyper-parameters combination the RandomizedSearchCV will then give the best hyper-parameters combination based on the scoring method that we have passed.

Here is the description of each hyperparameters that I use for each model:

KNeighborsClassifier

RandomForestClassifier

XGBClassifier

For the RandomizedSearchCV there are several parameters that i used:

(1) the meaning of the results

From the hyper-parameters tuning above we can see that the models score are improving as they become less overfit. The Random Forest Classifier output the best results in predicting the target (AmountSpent) of this dataset. The test score accuracy is 0.72 means 72% of the prediction are on point, and the F1 score is 0.714 because f1 score is the harmonic mean of precision and recall it means that the model is capable enough to predict each class correctly.

The best model that we get is RandomForestClassifier with its hyper-parameters set to:

{'n_estimators': 800, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': True}

(2) the future work direction to improve the model

To improve the model further we can use GridSearchCV rather than RandomizedSearchCV because GridSearchCV unlike RandomizedSearchCV that only test n number of hyper-parameters combinations, GridSearchCV test all of the possible combination that we have listed. Furthermore, we can also add more hyper-parameters and assign more possible value for each hyper-parameters. In addition, the hyper-parameters for the RandomizedSearchCV could also be tweaked such as increasing the CV to ensure the model performs well on unseen data. Lastly, we can test more machine learnings models especially the more advanced one.

Aside from the model, we can perform more feature engineering and pre-procecssing on the dataset as well. For example, make the continuous features to shaped like Bell curve(normal distribution) which will improve the performance on some models. It can be done by doing feature transformation by applying functions such as log or boxcox transformation.

Cross Validation

To validate the model we will use cross validation to test these so it can perform well on unseen data. For this test we will use the entire data.

As we can see our cross validation score does not differ much from the one we get before. Hence, this model is robust and good at predicting unseen data.

Feature Importance

As we can see above although all of the features has its importance value on predicting the target, but some features are more important than others. We can see that Salary, History, and Catalogs is the main features that is used to predicting the target. On the other hand, OwnHome and Gender does not considered as much important when predicting the target.

Customer Clustering Model

Pre-processing

From the visualization above we can see that there is outlier in the salary, Clustering is very sensitive to the outlier. Hence, in this clustering model we will need to do outlier handling. Not only in the continous data but outlier handling also needs to be done in the categorical data. However, from the EDA that we have done before there are no outlier in the categorical data. Hence, we will only do outlier handling on the continous data using Z score

Clustering model is different from the multiclass classification, Hence, some of the pre-processing that needs to be done are also different.

We have gotten the EDA from above before so we know what needs to be done in clustering model. The pre-processing methods that needs to be done are: Things need to be done in preprocessing from the EDA:

  1. Label Encoding: Age, History, AmountSpent, Location, Gender, OwnHome, Married. Age, history, and AmountSpent, Location has order so it is right to make them to numerical value using the label encoder so the machine learning model can take them as input. On the other hand, although Gender, OwnHome, and Married columns does not have order but they only has 2 distinct columns so it will be the same whether we use One-Hot encoder or Label Encoder for them.
  2. Fill missing data in "History" columns. Because the missing row is too big (above 30%) to be just dropped. Furthermore, the data is only missing in one column so no need to drop it.
  3. Outlier Handling: Outlier handling needs to be done so the cluster that we make will be accurate. The outlier handling only needs to be done on the continous data because the categorical does not have any outliers.
  4. Feature Selection based on RFM(Recency Frequency Monetary): For the clustering to be relevant and insightful, the features we choose also have to be relevant. That is why we have to use the feature that have correlation with customer segmentation.
  5. Standardization: Because the scale of the numerical features are different. For example Salary has a value in range of ten of thousands while Children is only 1 until 4. Some models will not be able to predict well using these features because they will think that Salary is more important than features that has low value such as Children. Therefore, we should make the scale of these features to be the same using standardization (StandardScaler).

Label Encoding

Impute Missing Data

We will use IterativeImputer because there is no information given about the missing dataset. Iterative Imputer will calculate and predict each missing data using other features that is not missing. Hence, the data that is filled will be more accurate.

Different from the multiclass classification before, there is no target columns yet in the dataset. Hence, we can use all of the columns for the IterativeImputer so the imputted value will be more accurate.

Because there are some values that is out of range and decimal value, we will round it then put it in range

Now the value is already in the correct range

Outlier Handling

Feature Selection based on RFM(Recency Frequency Monetary)

Features that will be used on the clustering model are the ones that can be categorized as RFM, so the features can be relevant for the customer segmentation. Those features are:

  1. "Salary": This value can be considered as monetary value because it shows how much the customer can spent on maximum.
  2. "Catalogs": This value can be considered as frequency because it shows how many catalogs that the customer get.
  3. "AmountSpent": This value can be considered as monetary because it shows how much money the customer had spent at Bee Department Store.
  4. "History": This value can be considered as Frequency because it shows how many items(volume) the customer has bought

Standardization

Standardization using StandardScaler is used to scale the data so all of the features will have same scale. Compared to normalization, Standardization is more robust to the outlier, normalization makes the data point have a range of value from 0 to 1. On the other hand, Standardization makes the data to have a mean of 0 and standard deviation of 1. Hence, even though there is a outlier it will not matter much. Because clustering is sensitive to outliers, therefore standardization is more appropriate for the feature scaling.

Modeling

For clustering, we will use two algorithms, one of the simplest clustering algorithm namely Kmeans and the advanced one that is DBSCAN (Density-based Spatial Clustering of Applications with Noise).

Kmeans works by finding a cluster point which minimizes the distance of each data point from the cluster point. This algorithm will start by randomly placing n cluster points, after that each data point will be clustered according to the nearest cluster point. Next, it will move the cluster point and see the changes in the cluster of the data points. This process will be repeated continously until it reach convergence (no data point changes cluster anymore).

Different from Kmeans, in DBSCAN there is a concept of noise. It means there can be a data point that is not a member of any cluster. This algorithm works by calculating the density using the nearest neighbor algorithm. The area where the density is thick will be considered as a cluster, and for each threshold decrease the peak density area will merge with another and considered as one cluster.

For the scoring of this clustering we will use inertia and silhouette. The Inertia will be used in elbow method to help find the best number of clusters while the silhouette will be used as it is to know how good the clustering prediction.

Kmeans Clustering

Because Kmeans is a clustering algorithm which is unsupervised, we cannot use GridSearchCV to search for the best combination of hyper-parameters. However, we can use elbow_method and silhouette_score to find the best n_clusters for it.

In kmeans the hyperparameter that we set is n_clusters and random_state:

Elbow Method

From the elbow method above we can see that the biggest change in Inertia is when the cluster change to 3, after that the change become insignificant. Therefore, 3 is the best number of clusters

Silhoute Score

From the silhouette score it can also be seen that when the cluster number is 3 there is an increase in silhouette value, that means the objects is more matched with its own clusters. On the other hand, in number of clusters = 3 the silhouette value drops. Considering the insight we get from the elbow method before, the best number of clusters is 3.

DBSCAN (Density-based Spatial Clustering of Applications with Noise)

Because clustering is unsupervised, we cannot directly test the model on how good the model is. Therefore, GridSearchCV cannot be used; however, we can create our own function for the hyper-paramters tuning. for DBSCAN the hyper-parameters that we will tune is:

Furthermore, we can use silhouette score to undirectly see the best combination of hyper-parameters because the silhouette value gauges an object's cohesion with its own cluster in comparison to other clusters (separation). A high number on the silhouette implies that the object is well matched to its own cluster and poorly matched to nearby clusters. The silhouette has a range of -1 to +1.

Same with the Kmeans Clustering, DBSCAN also results in 3 clusters, with 1 additional clusters for the noise and outliers. This proves that the best number of clusters for our customer segmentation is 3 clusters.

Result Comparison

The silhouette value gauges an object's cohesion with its own cluster in comparison to other clusters (separation). A high number on the silhouette implies that the object is well matched to its own cluster and poorly matched to nearby clusters. The silhouette has a range of -1 to +1.

The two models silhouette score are around 52%, it means the each data points are adequately good placed in which those data points are well matched to its own cluster and badly matched with other clusters. However, the results can still be further improved by using more complex model and more in-depth pre-processing method. In addition, the hyper-parameter tuning can also be improved by using various other hyper-parameters.

From the result above we can see that the KMeans has a better result on clustering the data.

PCA

From the PCA above we can see that 95% of the variance from the data can be explained by the PCA, because it is over 80% it means we can just use PC1 and PC2 to explain and represent our features so they can be visualized in 2d plot

Interpretation

From the cluster data we can see that customers in cluster 0 are the customers that have low salary, low amount of spent and few purchase volume at Bee Department Store. Therefore these type of customers can be categorized as low spender customer.

From the cluster data we can see that customers in cluster 1 are the customers that have very high salary, very high amount of spent and very high purchase volume at Bee Department Store. These type of customer can be categorized as high spender customer.

From the cluster data we can see that the customer in cluster 2 are the customers that have medium salary, medium amount of spent and medium volume purchase at Bee Department Store. These type of customer can be categorized as medium spender customer.